Aim: update the studies carried out by Regina H. Reynolds about STAR introns’ annotation distributions. Source material: STAR: intron annotation
This report is meant to be used as a template. As such, the different sections will include instructions on how to modify the graphic outputs and not descriptions about the results themselves.
Three different studies will be presented: disease status (Level 1), ataxia subtypes (Level 2) and diagnoses (Level 3). To run this script, it is necessary to source the file at /home/grocamora/Core_Projects/Ataxia_Splicing_Analysis/R/STAR_introns.R with needed helper functions.
# Paths
metadata_path <- file.path(main_path, "metadata/metadata.csv")
multiqc_path <- file.path(main_path, "metadata/multiqc_rseqc_read_distribution.txt")
sj_tab_path <- file.path(main_path, "data/SJ_out_tab_files/")
# Path to the gtf file to annotate the introns and the ENCODE blacklisted region
#gtf_ensembl_path <- file.path("/home/grocamora/RytenLab-Research/Additional_files/Homo_sapiens.GRCh38.105.gtf")
gtf_gencode_path <- file.path("/home/grocamora/RytenLab-Research/Additional_files/GENCODE/gencode.v38.annotation.gtf")
blacklist_path = file.path("/home/grocamora/RytenLab-Research/Additional_files/hg38-blacklist.v2.bed")
# Load the metadata
metadata <- readr::read_delim(metadata_path, show_col_types = FALSE) %>%
dplyr::rowwise() %>%
dplyr::mutate(Individual_ID = stringr::str_split(ID_anon, "_", simplify = T)[1]) %>%
dplyr::filter(!(Diagnosis %in% c("CANVAS", "AIFM1"))) %>%
dplyr::ungroup()Added back graphs by brain tissue.
Fixed wrong data being represented in section 4.1 - Level 1 (Type).
The first step is to load the junctions from the STAR alignment output files (SJ.tab.out). We employ the loadSJ function, an adaptation of the load_sj_df function by Regina. Source.
Since the sj.tab.out file can have different names for different datasets, please modify from the loadSJ function the variable junc_path to match with the name of the files.
# Modify the variable "junc_path" inside the function to match with the specific
# naming scheme of the dataset.
sj_df <- loadSJ(metadata, sj_tab_path)Following is an example of the dataframe generated (minus some columns not employed in the analysis):
| chr | intron_start | intron_end | strand | unique_reads_junction | sample |
|---|---|---|---|---|---|
| chr1 | 117940941 | 117941123 | 1 | 58 | C12A_Frontal |
| chr5 | 32000272 | 32010329 | 1 | 116 | C20B_Frontal |
| chr2 | 174944944 | 174952163 | 2 | 9 | C21B_Cerebellum |
| chr17 | 47808947 | 47809075 | 2 | 5 | CO6A_Cerebellum |
| chr5 | 127824787 | 127827210 | 2 | 1 | CO20B_Frontal |
| chr15 | 101657037 | 101657784 | 2 | 94 | CO8A_Frontal |
| chr14 | 81479741 | 81484224 | 2 | 51 | CO7A_Frontal |
| chr20 | 38434627 | 38435026 | 2 | 13 | C24B_Cerebellum |
| chr12 | 101401174 | 101402864 | 2 | 83 | C20B_Cerebellum |
| chr8 | 38552381 | 38553715 | 2 | 5 | CO6A_Cerebellum |
Next, we create the RSE object with the annotated junctions, the reads matrix and the sample metadata. We employ the function createRSE, upgraded and adapted from the function create_rse_jx by Regina. Source.
The main difference is that only junctions from chromosomes 1:22, X and Y are considered (can be modified in the function with the variable valid_seqnames).
Given that the annotation process is long, we only create the variable if it does not exists in disk.
rse_jx_path <- file.path(main_path, "variables/rse_jx_gencode38.rds")
if(!file.exists(rse_jx_path)){
rse_jx <- createRSE(sj_df, gtf_gencode_path, sample_info = metadata, sample_id_col = "ID_anon")
rse_jx %>% saveRDS(rse_jx_path)
}else{
rse_jx <- readRDS(rse_jx_path)
}This involves:
Removing all junctions that overlap with the ENCODE blacklisted regions.
Removing all junctions that are associated to multiple genes (this removes all junctions from the category ambig. gene and some of the annotated).
Removing all junctions that are less than 25 bp long.
encode_blacklist_hg38 <- loadEncodeBlacklist(blacklist_path)
rse_jx <- removeEncodeBlacklistRegionsRSE(rse_jx, encode_blacklist_hg38)
rse_jx <- removeAmbiguousGenesRSE(rse_jx)
rse_jx <- removeShortJunctionsRSE(rse_jx)The first plots shown here represent the distributions using all categories from the dasper annotation function. As mentioned, results are split in the different levels of study.
Across all visualizations, a Wilcoxon signed-rank test between the distributions were executed, and significant results (\(p<0.05\)) are shown in the graphs. The p-values are Bonferroni corrected given the number of tests executed per group.
First, we plot the category percentages by sample. We employ the function plotJunctionCategories, which allows to select a different level to group by (three different possibilities: Type, AtaxiaSubtype and Diagnosis). Source.
#' Plots the category percentages by sample
#'
#' @param annot_junc_prop_df dataframe, contains the proportion of junctions for
#' each category and each sample.
#' @param level character vector, specifies the level of study. Name of the
#' column that will split the graph and the Wilcoxon tests. For the Ataxia
#' RNAseq data, only "Type", "AtaxiaSubtype" and "Diagnosis" are valid inputs.
#' @param split_tissue boolean, whether to split the graph in facets by tissue.
#' @param ref_group character vector, category in the \code{level} field that
#' will be compare against for the Wilcoxon signed-rank test.
#'
#' @return
#' @export
plotJunctionCategories <- function(annot_junc_prop_df, level, split_tissue = F, ref_group = NULL){
p <- ggplot(annot_junc_prop_df,
aes(x = type, y = prop_junc, fill = !!sym(level))) +
geom_boxplot() +
scale_y_continuous(expand = expansion(mult = c(0.01, 0.15)), limits = c(0, NA)) +
labs(x = "", y = "Proportion of junctions") +
scale_fill_manual(values = pal_jco("default", alpha = 0.9)(10)[c(1, 9)]) +
ggpubr::geom_pwc(aes(group = !!sym(level)), label = " p = {p.adj.format}", p.adjust.method = "bonferroni",
p.adjust.by = "group", vjust = -0.5, hide.ns = T, ref.group = ref_group) +
custom_gg_theme
if(split_tissue){
p <- p +
facet_wrap(vars(Region), ncol = 1) +
scale_fill_manual(values = pal_jco("default", alpha = 0.9)(10)[c(3, 1, 9, 2, 7)])
}
return(p)
}Figure 3.1: Proportion of junctions by brain tissue.
Figure 3.2: Proportion of junctions by disease status and brain tissue.
Figure 3.3: Proportion of junctions by Ataxia Subtype and tissue.
Figure 3.4: Proportion of junctions by Diagnosis and tissue.
Next, we also represent the proportion of counts (or reads) that correspond to each category when compared against the total number of junction reads. We use the function plotJunctionCountCategories.
#' Plots the proportion of counts for each category and sample
#'
#' @param annot_junc_prop_df dataframe, contains the proportion of counts for
#' each category and sample.
#' @param level character vector, specifies the level of study. Name of the
#' column that will split the graph and the Wilcoxon tests. For the Ataxia
#' RNAseq data, only "Type", "AtaxiaSubtype" and "Diagnosis" are valid inputs.
#' @param tissue character vector, tissue to use to graph. Leave as NULL if no
#' distinction is used.
#' @param ref_group character vector, category in the \code{level} field that
#' will be compare against for the Wilcoxon signed-rank test.
#'
#' @return
#' @export
plotJunctionCountCategories <- function(annot_junc_prop_df, level, tissue = NULL, ref_group = NULL){
# If tissue is provided, we filter the dataframe.
if(!is.null(tissue)) annot_junc_prop_df <- annot_junc_prop_df %>% dplyr::filter(Region == tissue)
p <- ggplot(annot_junc_prop_df, aes(x = !!sym(level), y = prop_count, fill = !!sym(level))) +
geom_boxplot() +
scale_y_continuous(expand = expansion(mult = c(0.1, 0.2)), limits = c(0, NA)) +
labs(x = "", y = "Proportion of junctions") +
scale_fill_manual(values = pal_jco("default", alpha = 0.9)(10)[c(1, 9)]) +
facet_wrap(vars(type), ncol = 3, scales = "free") +
ggpubr::geom_pwc(aes(group = !!sym(level)), ref.group = ref_group, vjust = -0.3, hide.ns = T,
label = " p = {p.adj.format}", p.adjust.method = "bonferroni", p.adjust.by = "group") +
custom_gg_theme_subtitle# + theme(axis.text.x = element_blank(), axis.ticks.length.x = unit(0, "cm"))
if(!is.null(tissue)){
p <- p +
scale_fill_manual(values = pal_jco("default", alpha = 0.9)(10)[c(3, 1, 9, 2, 7)]) +
ggtitle("", subtitle = paste0("Tissue: ", tissue))
}
return(p)
}Figure 3.5: Proportion of counts by brain tissue.
Figure 3.6: Proportion of counts by disease status and brain tissue.
Figure 3.7: Proportion of counts by disease status and brain tissue.
Figure 3.8: Proportion of counts by Ataxia Subtype for Cerebellum tissue.
Figure 3.9: Proportion of counts by Ataxia Subtype for Frontal Cortex tissue.
Figure 3.10: Proportion of counts by Diagnosis for Cerebellum tissue.
Figure 3.11: Proportion of counts by Diagnosis for Frontal Cortex tissue.
Next, the same visualizations are presented but using only three categories: Annotated, Partially annotated and Unannotated.
Figure 4.1: Proportion of junctions by brain tissue.
Figure 4.2: Proportion of junctions by disease status and brain tissue.
Figure 4.3: Proportion of junctions by Ataxia Subtype and tissue.
Figure 4.4: Proportion of junctions by Diagnosis and tissue.
Figure 4.5: Proportion of counts by brain tissue.
Figure 4.6: Proportion of counts by disease status and brain tissue.
Figure 4.7: Proportion of counts by disease status and brain tissue.
Figure 4.8: Proportion of counts by Ataxia Subtype for Cerebellum tissue.
Figure 4.9: Proportion of counts by Ataxia Subtype for Frontal Cortex tissue.
Figure 4.10: Proportion of counts by Diagnosis for Cerebellum tissue.
Figure 4.11: Proportion of counts by Diagnosis for Frontal Cortex tissue.